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Method and device for generating and detecting a fingerprint functioning as a trigger marker 
in a multimedia signal 



10 



The invention relates to a method, and a corresponding device, of detecting 
trigger instants/information in a multimedia signal. The invention also relates to a method, 
and a corresponding device, of associating trigger information with a multimedia signal. 
Further, the invention relates to a computer readable medium having stored thereon 
instructions for causing one or more processing units to execute the method according to the 
invention. 



A current trend is to enhance passive television viewing and/or music listening 
on a given playback device by creating more interactive programs and/or listening 
experiences or by "connecting" external actions to a piece of video and/or audio content. As 
one simple example, a commercial can be enhanced by embedding a URL to a web site with 
15 further information, where the URL can be extracted and retrieved by the playback device. In 
order to facilitate such a function it is necessary to enable a reliable detection of time points 
in a television program, a movie, a music piece, etc. where such additional information is 
relevant. 

Examples of situations where such additional information is useful or 

20 interesting in connection with a broadcast program are: 

- trigg&link: (see e.g. W. ten Kate et.al. u trigg&link- A new dimension in television 
program making", Lecture Notes in computer Science, vol. 1242, pp51-65, Springer, 
1997) trigg&link allows interactivity in television programs. In addition to the normal 
program, additional information concerning specific segments of the program is 

25 available to the viewer through a different distribution channel. At the start of a given 

segment that is associated with an enhancement (additional information) an icon is 
displayed, alerting the viewer that additional information may be viewed on his TV. 
For instance, at the appearance of an actor in a movie, some biographic data of the 
actor may be made available. In the user terminal (e.g. a set top box, etc.) the icons 
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are overlayed on the video at the relevant time instants, thereby requiring these 
instants to have been indicated in the video stream. 
- Local Insertion: During a national broadcast, specific parts of the program may be 
replaced by a regional program in some regions. For instance, some advertisements 
5 may be replaced by advertisements for local shops, or, in a news show some regions 

may have their local weather forecast rather than the national one. The national 
program producer can indicate which segments are suitable for such local insertion. 
At the local redistribution site (e.g. at the cable head-end), the indicated segments 
may be replaced by local content. 
10 In both of the above situations, it is necessary to mark or associate specific 

time instants in the video stream at which additional information should be available. At 
these time instants the receiver should be triggered to perform or provide some kind of 
action. This may be done by such mechanisms as DSM-CC in MPEG/DVB. However, this 
requires the broadcaster's cooperation to insert these triggers thereby making an 
1 5 enhancement service provider dependent on the broadcaster. 

One previously known way of performing time marking in a video stream is 
e.g. using fields of the MPEG transport stream structure that can be used to hold the marking 
information. 

Another previously known way is using a blanking interval. In analog 
20 distribution, the marking information can be embedded in the vertical blanking interval or in 
the inactive video lines. 

Both of the above known ways need the cooperation of all actors in the 
broadcast chain to make sure that the marking information is not destroyed before the signal 
arrives at its destination. For instance, in case of the MPEG solution, a re-multiplexing 
25 operation could easily remove information that is written in the user data fields in the stream. 
Moreover, every decoding and successive re-encoding step would certainly not retain this 
information. In case of the use of the vertical blanking for carrying the trigger information, 
the situation is even more difficult, as actors in the broadcast chain might write other 
information at the same position (the vertical blanking is used for many things and there is no 
30 uniform agreement about the control over usage of the blanking interval). Also, standards 
converters (like PAL-NTSC) and other equipment in the broadcast chain may not retain all 
information in the vertical blanking interval. 

Yet another way is using watermarking. A watermark may be embedded in the 
video frames at the relevant time instants. The Philips Watercast System is, among others, 
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being sold for this purpose. A disadvantage of watermarking is the fact that it necessarily 
changes the video/audio. 



5 It is an object of the invention to provide a method and corresponding device 

of relating one or more trigger actions with a multimedia signal and corresponding method 
and device for detecting one or more trigger actions in a multimedia signal that solves the 
above-mentioned problems. A further object is to provide this in a simple and efficient way. 
Another object is to enable simple, reliable and accurate localisation of a given part of a 
1 0 multimedia signal. A further object is to enable detection of trigger actions without 
modifying the multimedia signal. 

This is achieved by a method (and corresponding device) of relating one or 
more trigger actions with a multimedia signal, the method comprising the steps of 

- providing at least one trigger time point and for each trigger time point providing at 

1 5 least one representation of least one associated trigger action, where each trigger time 

point indicates a time point of the multimedia signal for which the at least one 
associated trigger action is to be available during playback of the multimedia signal, 
and 

- for each given trigger time point deriving a fingerprint on the basis of a segment of 
20 the multimedia signal, where the segment of the multimedia signal is unambiguously 

related with the given trigger time point, 

- and by a method (and corresponding device) of detecting one or more trigger actions 
in a multimedia signal, the method comprising the steps of: 

- generating a fingerprint stream on the basis of the multimedia signal, 
25 - comparing a segment of the fingerprint stream with one or more fingerprints 

stored in a second database in order to determine if a match exists or not between 
the segment and a fingerprint in the second database, the second database further 
comprising for each stored fingerprint at least one representation of at least one 
associated action, and 

30 - if a match exists retrieving the at least one representation of the at least one 

associated action associated with the matching fingerprint. 

In this way, a simple and efficient way of handling time markers in a 
multimedia signal for given actions is obtained. A fingerprint thereby serves as a trigger of a 
particular action, event, etc. associated with a particular point in time of the multimedia 
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signal. Further, this is enabled without the multimedia signal needing to be modified. 
Additionally, the time marking detection is time independent as it is dependent on the 
specific content of the multimedia signal only thereby avoiding problems if a multimedia 
signal being a television program or the like is delayed. 
5 A fingerprint of a multimedia object/content/signal is a representation of 

perceptual features of the object/content/signal part in question. Such fingerprints are 
sometimes also known as "(robust) hashes". More specifically, a fingerprint of a piece of 
audio or video is an identifier which is computed over that piece of audio or video and which 
does not change even if the content involved is subsequently transcoded, filtered or otherwise 
10 modified. 

Preferably, the derived fingerprint is an audio and/or video fingerprint. 
Alternatively, animations and/or streaming text, etc. is used as a source for creating a 
fingerprint. 

Advantageous embodiments of the methods and devices according to the 
15 present invention are defined in the sub-claims. 

Further, the invention also relates to a computer readable medium having 
stored thereon instructions for causing one or more processing units to execute the method 
according to the present invention. 

20 

Figure la schematically illustrates generation of fingerprint(s) used as trigger 
marker(s) according to the present invention. 

Figure lb schematically illustrates detection and use of fingerprint(s) as trigger 
marker(s) according to the present invention. 
25 Figure 2 illustrates a schematic block diagram of a fingerprint generation 

device according to the present invention; 

Figure 3 illustrates a schematic block diagram of a playback device detecting 
and using fingerprints according to the present invention; 

Figure 4 illustrates one example of tables/records according to the present 

30 invention. 



Figure la schematically illustrates generation of fingerprint(s) used as trigger 
marker(s) according to the present invention. 
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Shown is a digital or analog multimedia signal (101) comprising video and/or 
audio information/content, where one or more 'trigger' actions (forth denoted actions) is to 
be associated/related with the multimedia signal (101) at certain given 'trigger' time points 
(forth denoted time points). The one or more actions associated with each time point is to be 
5 available, i.e. triggered, at that given particular time point (T n ; Tn+i) during playback on a 
playback device. The notation '(T n ; T n+ i)' for a given time point signifies that the time point 
may be either the shown time point T n or the shown time point Tn+i or in general any suitable 
(not shown) time point of the signal (101). The associated actions of multiple time points 
may be the same, different and/or a mix thereof. 
10 The action(s) to be presented/triggered at a given time point may e.g. comprise 

retrieving and displaying additional information on a display (e.g. presenting biography data 
for an actor being shown by the multimedia signal, presenting a selectable URL to a web site 
containing additional information, etc.), retrieving and playing additional information via a 
speaker, playing another multimedia signal instead of said multimedia signal (101) for a 
15 predetermined or variable period of time (e.g. a local weather forecast, a local commercial, 
etc.) and/or the like. Other examples of action(s) are e.g. stopping/pausing, e.g. temporarily, 
display /play, executing other control commands, and/or preparing the system for user 
input(s), e.g. once the trigger action is executed the system waits (for some time) for a 
specific action of the user. If the trigger action was not executed, the user input will not have 
20 any influence. For example, in interactive games the user may only submit his answer after 
the trigger action has fired/been executed. 

For each time point (T n ; T n+ i) a fingerprint (102) is generated on the basis of a 
part, segment, etc. (forth denoted segment) of the multimedia signal (101), where the 
segment of the multimedia signal (101) is unambiguously related with the given time point 
25 (T n ; T n+ i). Preferably, the segment of the multimedia signal (101) is unambiguously related 
with the given time point (T n ; T n+ 0 by letting the segment of the multimedia signal (101) 
ending substantially at the given time point (T n ; T n +i). In alternative embodiments, the 
segment of the multimedia signal (101) may start substantially at the given time point (T„; 
Tn+0* the segment of the multimedia signal (101) is starting or ending at a predetermined 
30 distance before or after the given trigger time point (T n ; T n +i)» or the given time point (T n ; 
T,h-i) may be at a predetermined time point between a start and an end of the segment of the 
multimedia signal (101). 

The size of the fingerprints and/or the segments may both be of a 
predetermined fixed size or alternatively of a variable size. 
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One method for computing a robust fingerprint is described in European patent 
application 01200505.4, although of course any method for computing a robust fingerprint 
can be used. 

European patent application 01200505.4 describes a method that generates 
5 robust fingerprints for multimedia content such as, for example, audio clips, where the audio 
.clip is divided in successive (preferably overlapping) time intervals. For each time interval, 
the frequency spectrum is divided in bands. A robust property of each band (e.g. energy) is 
computed and represented by a respective fingerprint bit. 

Multimedia content is thus represented by a fingerprint comprising a 
10 concatenation of binary values, one for each time interval. The fingerprint does not need to 
be computed over the whole multimedia content, but can be computed when a portion of a 
certain length has been received. There can thus be plural fingerprints for one multimedia 
content, depending on which portion is used to compute the fingerprint over. 

Further, video fingerprinting algorithms are known, e.g. from the following 
15 disclosure: Job Oostveen, Ton Kalker, Jaap Haitsma: "Feature Extraction and a Database 
Strategy for Video Fingerprinting". 1 17-128. IN: Shi-Kuo Chang, Zhe Chen, Suh-Yin Lee 
(Eds.): Recent Advances in Visual Information Systems, 5th International Conference, 
VISUAL 2002 Hsin Chu, Taiwan, March 1 1-13, 2002, Proceedings. Lecture Notes in 
Computer Science 23 1 4 Springer 2002. 
20 According to the present invention, a fingerprint (102) is generated for each 

time point on the basis of a given segment of the multimedia signal (101) at or near the 
specific time point. 

In this way, a given fingerprint (102) is a trigger marker enabling a very 
accurate and very precise location of a given time point of the signal (101) without using the 
25 specific time point but instead using (a segment of) the signal. Further, this is enabled 
without changing the signal. For video fingerprinting the localisation is typically frame 
accurate, at least as long as any distortion of the video signal is not too severe. 

After a fingerprint (102) has been generated it is stored for later use in a 
database, memory, storage and/or the like. 
30 There are several advantages in storing fingerprints (102) for a multimedia 

signal (101) in a database instead of the multimedia signal itself. To name a few: 
The memory/storage requirements for the database are reduced. 

The comparison of fingerprints is more efficient than the comparison of the 
multimedia signals themselves, as fingerprints are substantially shorter than the signals. 
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Searching in a database for a matching fingerprint is more efficient than 
searching for a complete multimedia signals, since it involves matching shorter items. 

Searching for a matching fingerprint is more likely to be successful, as small 
changes to a multimedia signal (such as encoding in a different format or changing the bit 
5 rate) do not affect the fingerprint. 

Alternatively, if the multimedia content is in the form of more than a single 
signal, e.g. a separate audio signal and a separate video signal, the fingerprint(s) (102) may 
be generated on the basis of a single of them (audio or video) or on both. 

The generated fingerprints (102) stored in the database may then be distributed 
10 to playback devices via the Internet or in a side-channel of a broadcast channel or via some 
other channel or other means for use during playback according to the present invention. As 
other examples of distribution is e.g. physical distribution on a storage medium or in a non- 
electronic way, e.g. requiring the user to enter the fingerprints and actions manually into the 
playback device. 

15 In a preferred embodiment, a representation of the associated action(s) is also 

stored for each fingerprint in the database. These representations are preferably also sent to 
the playback devices. In an alternative embodiment, representations are not stored in the 
database or used at all when generating the fingerprints. Another party may then provide the 
representations to the relevant playback devices as well as a relationship between each 

20 fingerprint and its associated action(s). 

Figure lb schematically illustrates detection and use of fingerprint(s) as trigger 
marker(s) according to the present invention. Shown is a digital or analog multimedia signal 
(101) comprising video and/or audio information/content, where the signal (101) is played 
back by a suitable playback device. Further shown is a fingerprint stream (104) that is 

25 generated continuously or substantially continuously on the basis of the multimedia signal 
(101). Alternatively, the fingerprint stream (104) is generated in segments. The fingerprint 
stream (104) (or segments) is compared with fingerprints (102) stored in a database. The 
stored fingerprints (102) generated as explained in connection with Figure la at a production 
site. The database preferably also comprises representations of the one or more associated 

30 actions (105) for each stored fingerprint (102). The stored fingerprints (102) are e.g. received 
via the Internet or in a side-channel of a broadcast channel or via some other channel or other 
means from the distribution site. The representations of the associated action(s) (105) may 
also be received like this. In an alternative embodiment, the representations as well as a 
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relationship between each fingerprint and its associated action(s) (105) are provided by 
another party. 

When a match between a segment of the fingerprint stream (104) and a given 
fingerprint (102) in the database is found the representation(s) of the associated action(s) 
5 (105) of that particular fingerprint (102) is retrieved and executed at the appropriate time 
point (T n ; T n +i). When a match between a segment of the fingerprint stream (104) and a 
fingerprint (102) in the database, the appropriate time point (T n ; TVi) is also determined 
when the fingerprints (102) have been generated as explained in connection with Figure la. 
Preferably, the given time point (T„; T n+ 0 is determined by letting the segment of the 

10 multimedia signal (101) that the matching fingerprint originally has been based on during 
generation (according to Figure la) ending substantially at the given time point (T n ; T^,). In 
alternative embodiments, the segment of the multimedia signal (101) may start substantially 
at the given time point (T n ; Tn+i), the segment of the multimedia signal (101) is starting or 
ending at a predetermined distance before or after the given trigger time point (T n ; T n+ i) or 

15 the given time point (T n ; T n+ i) may be at a predetermined time point between a start and an 
end of the segment of the multimedia signal (101). The playback device simply needs to be 
aware of the relationship between a given fingerprint and the given time point used during 
generation. 

When a matching fingerprint (102) is determined, the associated one or more 
20 actions is also retrieved. The playback device may then execute these actions or present them 
to a user e.g. awaiting user confirmation before executing them. 

The above-mentioned European patent application 01200505.4 describes 
various matching strategies for matching fingerprints computed for an audio clip with 
fingerprints stored in a database. 
25 Further European patent application 01 202720.7 describes an efficient method 

of matching a fingerprint representing an unknown information signal with a plurality of 
fingerprints of identified information signals stored in a database to identify the unknown 
signal. This method uses reliability infonnation of the extracted fingerprint bits. The 
fingerprint bits are determined by computing features of an information signal and 
30 thresholding said features to obtain the fingerprint bits. If a feature has a value very close to 
the threshold, a small change in the signal may lead to a fingerprint bit with opposite value. 
The absolute value of the difference between feature value and threshold is used to mark each 
fingerprint bit as reliable or unreliable. The reliabilities are subsequently used to improve the 
actual matching procedure. 
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A further advantage of the present invention is that if for any reason the 
broadcast is delayed, the fingerprint matching ensures that the trigger actions still appear at 
the correct corresponding moment in the broadcast since the invention is time-independent 
but content-dependent. 

5 Figure 2 illustrates a schematic block diagram of a fingerprint generation 

device according to the present invention. Shown is fingerprint generation device (200) 
comprising a multi-media signal input module (201), a fingerprinting module (202), a data 
base, memory storage and/or the like (203) communicating via a bus (205) or the like under 
the control of one or more microprocessors (not shown). The fingerprint generation device 
1 0 (200) may in one embodiment optionally also comprise a transmitter and receiver (204) for 
communicating with other systems, devices, etc. via a wired and/or wireless network e.g. like 
the Internet. 

The multi-media signal input module (201) receives multimedia content e.g. in 
the form of an analog or digital audio and/or video signal and feeds the multimedia content to 

1 5 the fingerprinting module (202). The fingerprinting module (202) computes a fingerprint on 
the basis of the received multi-media content. A fingerprint may be derived for the entire 
content or for a part of the content. Alternatively, several fingerprints may be derived each 
from a different part. According to the present invention, a fingerprint is derived each time 
that a trigger action is needed, i.e. for each time point (T„; Tn+i), as explained in connection 

20 with Figure 1 a. A representation of the time point(s) is also supplied to the fingerprinting 
module (202). 

The fingerprinting module (202) then supplies the computed fingerprint(s) to 
the database (203) preferably together with the associated one or more actions for each 
fingerprint. As shown in Figure 4, the database (203) comprises fingerprints 'FP1', 'FP2\ 
25 'FP3\ *FP4\ 'FP5\ etc. and respective associated actions 'Al', 'A2\ 'A3', «A4\ 'A2, Al\ 
etc. 

The database (203) can be organized in various ways to optimize query time 
and/or data organization. The output of the fingerprinting module (202) should be taken into 
account when designing the tables in the database (203). In the embodiment shown in Figure 
30 4, the database (203) comprises a single table with entries (records) comprising respective 
fingerprints and associated (sets) of actions. 

Figure 3 illustrates a schematic block diagram of a playback device detecting 
and using fingerprints according to the present invention. Shown is a playback device (300) 
comprising a multimedia signal receiver (301), a fingerprint detector (302), a display/play 
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circuit (303), a data base, memory storage and/or the like (203') communicating via a bus 
(205) or the like under the control of one or more microprocessors (not shown). The playback 
device (300) may in one embodiment optionally also comprise a transmitter and receiver 
(204) for communicating with other systems, devices, etc. via a wired and/or wireless 
5 network e.g. like the Internet. 

The multimedia signal receiver (301) receives the multimedia signal e.g. in the 
form of an analog or digital audio and/or video signal to be displayed and/or played e.g. from 
a broadcasting cable, antenna, satellite dish, etc. arrangement (not shown). The received 
multimedia signal is fed to the fingerprint detector (302) that derives a fingerprint stream or 
10 segments thereof and determines if there are any matches with fingerprints stored in the 

database as explained in connection with Figure lb. If a match is found then a representation 
of the associated action(s) is also retrieved. The appropriate time point for the associated 
action(s) is given by the matching fingerprint as described above. 

The received multimedia signal is displayed and/or played by the a 
15 display/play circuit (303) and at the appropriate time point(s) the associated action(s) is 

executed or presented to a user e.g. awaiting user confirmation before executing the action(s). 

Preferably, the data layout of the database (203') corresponds to the one 
shown in Figure 4. 

The playback device (300) may also comprise a buffer mechanism (not 
20 shown) for buffering a part of the multimedia signal before displaying/playing it. 

Figure 4 illustrates one example of tables/records according to the present 
invention. Shown is a table comprising fingerprints (102) C FP1\ 'FP2', 6 FP3\ 'FP4\ 'FP5', 
etc. and respective associated actions (105) *A1', 'A2', 'A3', 6 A4\ 'A2, Al\ etc. One or 
more actions (105) are stored for each fingerprint (102). A given fingerprint (102) is only 
25 stored in the table once. 

In the claims, any reference signs placed between parentheses shall not be 
constructed as limiting the claim. The word "comprising" does not exclude the presence of 
elements or steps other than those listed in a claim. The word "a" or "an" preceding an 
element does not exclude the presence of a plurality of such elements. 
30 The invention can be implemented by means of hardware comprising several 

distinct elements, and by means of a suitably programmed computer. In the device claim 
enumerating several means, several of these means can be embodied by one and the same 
item of hardware. The mere fact that certain measures are recited in mutually different 
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dependent claims does not indicate that a combination of these measures cannot be used to 
advantage. 



